final project for twitter data analytic

date: may 7th author: jingyi li github ID: ulrica221

This is a project about analysis the hot topic from twitter in the time period of COVID-19. we clustered, and then draw auto-label graphs from analysis. the clustering method we originally use is use k-mean and do elbow test. but the shortage of for this method is that it has sometimes we cannot clearly see the elbow and people need to manually set the number to k after checking the elbow graph. The field I am focus on is to changing the method of elbow-test to a method that can auto select k.

so the method I am using is the R function silhouette. I made a function silhoette_coe that help me calculate the mean silhoette number with i clusters. the higher the silhoette number is, the better the clusters is. then by using a for loop, I make every number into a vector to find the max silhoette so that it is the cluster what I want to. so the code here is

silhouette_coe <- function(i){ #input number of clusters km <-kmeans(tweet.vectors.matrix,centers = i) score <- silhouette(km$cluster, dist(tweet.vectors.matrix)) #parameter num of clusters and dist()

mean(score[, 3]) # since the return type of silhouette will be a n*3 }

fc = 2; # starting cluster nc = 40; # end cluster i <- fc:nc avg_score <- sapply(i, silhouette_coe) #put it into a list for finding max plot(i, type=‘b’, avg_score, xlab=‘number clusters’, ylab=‘Average Silhouette-Scores’, frame=FALSE)

#here we start find the max value for k
for (i in 1:(nc-fc)) {

if (avg_score[ i]==max(avg_score)) k<-i+1 #update k value

}

after what I did here, then you can find out for a test case of 5000, the silhouette suggest a cluster of 21 on my first notebook version. the answer may change, it depends on the data point and silhouette favors a more homogeneous clusters so that it can get accuate result.

Running a case of 10000 takes me about a hour. the time complextiy for the silhouette is bad so that it is not ok to run big cases.

# query start date/time (inclusive)
rangestart <- "2020-04-01 00:00:00"
# query end date/time (exclusive)
rangeend <- "2020-04-16 00:00:00" #4-16 orignnally
# query semantic similarity phrase (choose one of these examples or enter your own)
#semantic_phrase <- "Elementary school students are not coping well with distance learning."
#semantic_phrase <- "How do you stay at home when you are homeless?"
#semantic_phrase <- "My wedding has been postponed due to the coronavirus."
#semantic_phrase <- "I lost my job because of COVID-19. How am I going to be able to make rent?"
#semantic_phrase <- "I am diabetic and out of work because of coronavirus. I am worried I won't be able to get insulin without insurance."
#semantic_phrase <- "There is going to be a COVID-19 baby boom..."
semantic_phrase <- ""
# return results in chronological order or as a random sample within the range
# (ignored if semantic_phrase is not blank)
random_sample <- TRUE
# number of results to return (max 10,000)
resultsize <- 5000  #originally 10000
####TEMPORARY SETTINGS####
# number of high level clusters (temporary until automatic selection implemented)
k <- if (semantic_phrase=="") 15 else 5
# number of subclusters per high level cluster (temporary until automatic selection implemented)
cluster.k <- 8
# show/hide extra info (temporary until tabs are implemented)
show_original_subcluster_plots <- FALSE
show_regrouped_subcluster_plots <- TRUE
show_word_freqs <- FALSE
show_center_nn <- FALSE

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## [1] 38
## [1] "Subclustering cluster 1 ..."
## [1] "Subclustering cluster 2 ..."
## [1] "Subclustering cluster 3 ..."
## [1] "Subclustering cluster 4 ..."
## [1] "Subclustering cluster 5 ..."
## [1] "Subclustering cluster 6 ..."
## [1] "Subclustering cluster 7 ..."
## [1] "Subclustering cluster 8 ..."
## [1] "Subclustering cluster 9 ..."
## [1] "Subclustering cluster 10 ..."
## [1] "Subclustering cluster 11 ..."
## [1] "Subclustering cluster 12 ..."
## [1] "Subclustering cluster 13 ..."
## [1] "Subclustering cluster 14 ..."
## [1] "Subclustering cluster 15 ..."
## [1] "Subclustering cluster 16 ..."
## [1] "Subclustering cluster 17 ..."
## [1] "Subclustering cluster 18 ..."
## [1] "Subclustering cluster 19 ..."
## [1] "Subclustering cluster 20 ..."
## [1] "Subclustering cluster 21 ..."
## [1] "Subclustering cluster 22 ..."
## [1] "Subclustering cluster 23 ..."
## [1] "Subclustering cluster 24 ..."
## [1] "Subclustering cluster 25 ..."
## [1] "Subclustering cluster 26 ..."
## [1] "Subclustering cluster 27 ..."
## [1] "Subclustering cluster 28 ..."
## [1] "Subclustering cluster 29 ..."
## [1] "Subclustering cluster 30 ..."
## [1] "Subclustering cluster 31 ..."
## [1] "Subclustering cluster 32 ..."
## [1] "Subclustering cluster 33 ..."
## [1] "Subclustering cluster 34 ..."
## [1] "Subclustering cluster 35 ..."
## [1] "Subclustering cluster 36 ..."
## [1] "Subclustering cluster 37 ..."
## [1] "Subclustering cluster 38 ..."
## [1] "Plotting cluster 1 ..."
## [1] "Plotting cluster 2 ..."
## [1] "Plotting cluster 3 ..."
## [1] "Plotting cluster 4 ..."
## [1] "Plotting cluster 5 ..."
## [1] "Plotting cluster 6 ..."
## [1] "Plotting cluster 7 ..."
## [1] "Plotting cluster 8 ..."
## [1] "Plotting cluster 9 ..."
## [1] "Plotting cluster 10 ..."
## [1] "Plotting cluster 11 ..."
## [1] "Plotting cluster 12 ..."
## [1] "Plotting cluster 13 ..."
## [1] "Plotting cluster 14 ..."
## [1] "Plotting cluster 15 ..."
## [1] "Plotting cluster 16 ..."
## [1] "Plotting cluster 17 ..."
## [1] "Plotting cluster 18 ..."
## [1] "Plotting cluster 19 ..."
## [1] "Plotting cluster 20 ..."
## [1] "Plotting cluster 21 ..."
## [1] "Plotting cluster 22 ..."
## [1] "Plotting cluster 23 ..."
## [1] "Plotting cluster 24 ..."
## [1] "Plotting cluster 25 ..."
## [1] "Plotting cluster 26 ..."
## [1] "Plotting cluster 27 ..."
## [1] "Plotting cluster 28 ..."
## [1] "Plotting cluster 29 ..."
## [1] "Plotting cluster 30 ..."
## [1] "Plotting cluster 31 ..."
## [1] "Plotting cluster 32 ..."
## [1] "Plotting cluster 33 ..."
## [1] "Plotting cluster 34 ..."
## [1] "Plotting cluster 35 ..."
## [1] "Plotting cluster 36 ..."
## [1] "Plotting cluster 37 ..."
## [1] "Plotting cluster 38 ..."
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors